Skip to content

[hub] fix kernels fp8#9271

Open
Jintao-Huang wants to merge 2 commits into
modelscope:mainfrom
Jintao-Huang:fix_kernels
Open

[hub] fix kernels fp8#9271
Jintao-Huang wants to merge 2 commits into
modelscope:mainfrom
Jintao-Huang:fix_kernels

Conversation

@Jintao-Huang
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the transformers dependency version and introduces a _patch_kernels context manager in transformers_engine.py to handle modelscope hub patching during inference. The review feedback points out critical flaws in the context manager's implementation, including a lack of exception safety that could leave the global state inconsistent and potential thread-safety issues when processing concurrent requests.

Comment on lines +390 to +403
def _patch_kernels(self):
use_hf = self.use_hf
if use_hf is None:
use_hf = True if use_hf_hub() else False
if not use_hf:
try:
from modelscope import patch_hub, unpatch_hub
except ImportError:
use_hf = True
if not use_hf:
patch_hub()
yield
if not use_hf:
unpatch_hub()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _patch_kernels context manager has two significant issues:

  1. Exception Safety: It lacks a try...finally block around the yield. If an exception occurs during generation (e.g., in self.template.generate), unpatch_hub() will never be called, leaving the global environment in a patched state.
  2. Thread Safety: patch_hub and unpatch_hub modify global state in modelscope/transformers. Since _infer_stream executes generation in a separate thread, concurrent requests can lead to race conditions where one thread's cleanup unpatches the hub while another thread is still using it.

Consider using a try...finally block for safety. For thread safety, if this engine is used in a concurrent environment, a global lock and reference counter might be necessary, or simply patching once during initialization if the mode is consistent for the process.

    def _patch_kernels(self):
        use_hf = self.use_hf if self.use_hf is not None else use_hf_hub()
        if use_hf:
            yield
            return

        try:
            from modelscope import patch_hub, unpatch_hub
        except ImportError:
            yield
            return

        patch_hub()
        try:
            yield
        finally:
            unpatch_hub()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants